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ABSTRACT 

Conventional multimedia annotation/retrieval systems such 
as Normalized Continuous Relevance Model (NormCRM) 
require a fully labeled training data for a good performance. 
Active Learning, by determining an order for labeling the 
training data, allows for a good performance even before 
the training data is fully annotated. In this work we pro¬ 
pose an active learning algorithm, which combines a novel 
measure of sample uncertainty with a novel clustering-based 
approach for determining sample density and diversity and 
integrate it with NormCRM. The clusters are also itera¬ 
tively refined to ensure both feature and label-level agree¬ 
ment among samples. We show that our approach outper¬ 
forms multiple baselines both on a recent, open character 
animation dataset and on the popular TRECVID corpus 
at both the tasks of annotation and text-based retrieval of 
videos. 

Categories and Subject Descriptors 

H. 3.3 [Information Search and Retrieval]: Clustering, 
Retrieval Models; H.5.1 [Multimedia Information Sys¬ 
tems]: Video (e.g., tape, disk, DVI) 

Keywords 

Active Learning; Clustering; Uncertainty; Informativeness 

I. INTRODUCTION 

The ubiquity of multimedia content in our daily lives re¬ 
quires effective tools for multimedia annotation and retrieval. 
Multimedia annotation tools automatically annotate image 
or video content (samples) with text labels specifying dif¬ 
ferent objects, events, etc. called concepts. Most of these 
systems treat the task of automatic annotation as a classifi¬ 
cation challenge, whereby a separate classifier is trained for 
each of these concepts [^, [^, [^, [^. However, fewer ap¬ 
proaches explore the correlation between these concepts . 
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A typical multimedia retrieval system, on the other hand, 
ranks the multimedia samples based on their relevance to 
the user’s text query. Generally, the retrieval is done by 
comparing the query to the sample concept labels. Thus an 
exhaustive annotation of the sample is often a pre-requisite 
for such retrieval systems. 

Normalized Continuous Relevance Model (NormCRM) 
is an example of a technique that allows for a direct retrieval 
of samples without having to annotate them. However train¬ 
ing this model (like many others), requires fully annotated 
data. The human-effort costs of concept annotation is sig¬ 
nificant and this raises an interesting research question: is 
there a way to achieve a decent annotation/retrieval perfor¬ 
mance without requiring a fully annotated training dataset? 

The community has taken to Active Learning to address 
this issue [^. Active Learning, is a machine learning tech¬ 
nique that interactively selects unlabeled samples and queries 
an oracle to provide labels for the samples. Such a system 
outputs an order of labeling the samples such that a de¬ 
cent annotation/retrieval performance is achieved before all 
unlabeled data is queried. A typical active learning sys¬ 
tem consists of a learning engine, which does the annota¬ 
tion/retrieval and a sample selection engine, responsible for 
determining the labeling order of the unlabeled samples. 

In this work, we use NormCRM as the learning engine and 
propose a novel sample selection algorithm. We call this in¬ 
tegrated system CRMActive and apply it for video annota¬ 
tion and video retrieval tasks. The algorithm uses a measure 
of informativeness for ranking unlabeled samples during ac¬ 
tive learning. This informativeness combines a new measure 
of sample uncertainty with a novel cluster-refinement based 
approach for determining sample density and diversity. Our 
experiments show that CRMActive outperforms a state-of- 
the-art approach and a random baseline. 


2. PROPOSED APPROACH 

Normalized Continuous Relevance Model (NormCRM) is 
a generative annotation/retrieval technique f^. Let’s con¬ 
sider a video sample I defined by a M-dimensional feature 
vector r and V be the vocabulary of all concept labels (each 
concept 1 word long). NormCRM defines conditional prob¬ 
ability for using a label word le G V to anno tate the video /, 
as P{w\r) = P(ie, r)/P(r). Lavrenko et al. suggest that 
for annotation we pick the iop-k words with highest P{wi\r), 
i — 1,2,...,A:. For the task of retrieval using a query word re, 
we pick the top-t videos with highest P{w\ri), r = 1, 2,..., t. 
In both cases, the joint-distribution of words and features 


P(w, r) is estimated from the training data by 


p(w,r)=^(p(j) n n 


jeT 


Gr,i=l 


where T is the set of training video samples and w is the set 
of words in question. 

However, NormCRM requires a fully annotated data for 
training. To circumvent this, we integrate NormCRM into 
an Active Learning framework by combining it with a sample 
selection engine, which selects samples for annotation based 
on their informativeness. We calculate the informativeness 
by combining measures of sample uneertainty, density and 
diversity. 

Sample Uneertainty is a measure of how uncertain the 
learning engine is about the labels of a sample. Using SVM 
as a learning engine, entropy and distance of sample from 
decision boundary have been explored as sample uncertainty 
measures 23 . However, these techniques don’t capture 
a measure of the ambiguity between the relevant labels and 
the irrelevant ones for NormCRM-based models. Hence, we 
define a novel measure of uncertainty of an unlabeled sample 
(dehned by a M-dim. feature x) as: 


ct(x) = 


1 


P(iei|x) - P(ie/e+i|x) ’ 


( 1 ) 


as video features, a Gaussian kernel is often used [24] : 

KGauss{y^,y^) = exp(-||x - x'||^/ 2 cr^), 

where x, x' G T. For discrete input spaces, such as the space 
of labels, a Bernoulli product kernel may be used [12| : 

/V:Bem(x,x') = [(7^“ XTd") X X (1-7^)^^““=^^], 

d=l 

where x,x' G { 0 , 1 }^, shows the presence ( 1 ) or ab¬ 

sence ( 0 ) of the d*^ concept and 7 ^ is the probability of the 
df^ concept occurring. In order to capture the notion of sam¬ 
ple similarity both from the visual and label perspectives, we 
define a new kernel as a combination of the two [^: 

iF(x,X ) = iFsern(x,X ) X iFGawss(x, x') 


Once we clustered the sample videos, we compute the sam¬ 
ple density of an unlabeled sample x in cluster C as 


den{:K.) = 


p(x) 

max p(xi) ’ 
x^Ga’ 


where p(x) is the kernel density estimate: 


p(x) = 


1 


ifGa«ss(x,Xi) 

GC 


where wi,...,Wk (in decreasing order of relevance) are the 
top-k most relevant labels assigned to x. The denominator 
in Eq. gives a measure of the gap (distance) between the 
posterior probabilities of the most relevant label and the first 
irrelevant one and can thus be used to obtain uncertainty. 

Sample Density is a measure of how likely a certain sample 
is to occur given the underlying distribution that generated 
the data while a high Sample Diversity score ensures that the 
samples chosen for labeling aren’t too similar to each other. 
To compute sample density and diversity, we start by clus¬ 
tering all samples in the training data A = {xi,X 2 , ...,XAr}, 
consisting of the initial labeled training data C and the unla¬ 
beled training data U {X = CUU). We first represent every 
sample in the visual feature space and perform X-Means 
clustering. X-Means is a variant of K-Means, which auto¬ 
matically picks the parameter K by comparing the Bayesian 
Information Criterion (BIC) scores of the clustering system 
for a range of values of K and picking the one with an opti¬ 
mal score . We then check if every labeled sample shares 
a concept with at least one other labeled sample in the same 
cluster. A sample that shares no labels, is removed from the 
cluster and we use it to create a new cluster and redistribute 
unlabeled samples from the original cluster between the old 
and the new clusters using 2-Means. 

In order to measure the extent of agreement amongst the 
labeled samples in a cluster, both in terms of their visual 
features and their labels, we use Empirieal Entropy [^. For 
a cluster C, it is defined as: 

ft® =-:i^iog(:i^is:(xi,xj)), (2) 

n. 77 / . 

1=1 J=1 

where there are tt, > 1 labeled samples in the cluster and 
iL(.,.) is a kernel function. A kernel is a mapping : X x X ^ 
R, where x is the input space. A kernel may be considered 
as a measure of similarity. For continuous input spaces, such 


and |C| is the total number of samples in cluster C. 


Algorithm 1 CRMActive 

Input: The set C = {li, I 2 , ..., Ip}, their labels y = 

{yi, y 2 , •••, yp} where yi G {0, 1}^, the set U = {ui, U 2 , uq} 
and K nos. of samples to pick in a batch. 

Output: The set 72, containing the order in which the unlabeled 
samples are labeled. 

Algorithm: 

Perform X-Means, using the visual features, on the set of C UlA 
samples. Say, T be the optimal number of clusters and let rep(Ci) 
denote the representative sample of cluster Cp 

Check if VIj, \j G 72, \j G Ck, b shares > 1 concept with at least 1 
labeled sample in C^, otherwise call Redistribute(C/e;, Ij). 
hworst := nil // Initialize hworst 
while lA ^ (j) do 

Train NormCRM using 72, evaluate model on test set. 

Update hworst to max. entropy value among all clusters with 
at least 2 labeled samples 

Compute Info(:Ki),\/:Ki G U 

Pick top-iV samples. Lab = {ai,a 2 , ...,ax} for labeling. 

72 := 72 U Lab, U := U — Lab // Update the lists 
// Now refine the clusters based on newly labeled samples 

for j = 1,2,..., K do 

if h-worst = NIL then //If oT'st not set 

Check if sample slj , aj G Ck shares > 1 concept with at 
least 1 labeled sample in Ck, otherwise call Redistribute(Cfc, a^). 
else// Determine which sample in Ck to knock out 

Compute ^ where aj G Ck //Ck > 1 labeled sample 
if > hujorst then // Exceeds threshold 
for r = 1, 2, ..., labeled samples in Ck do 
C} := Ck — labeled sample in Ck 

if h^k < h^orst then // Meets threshold 
W := labeled sample 
Redistribute(Cfc, W) // Split cluster 
break 


Our definition of the sample density, though similar to Zha 
et al. [^, differs by using clusters, which are refined (see 
later in this section), to determine the neighboring samples 
of X rather than a static set of its k-nearest neighbors. 

To compute the sample diversity, we use the angular dis¬ 
tance between features similar to Brinker’s technique [^. 









procedure REDlSTRlBUTE(Samples in C^, a) 

Input: Set of all samples in cluster Ck & the seed sample a 
Output: Updated set of clusters 

Algorithm: 

Create a new cluster, with a as the centroid. 

Perform 2-Means on the unlabeled samples of cluster Ck with 
rep(Ck) and a as the two initial cluster centroids. 

Update rep{C'^) as the representative sample of cluster C'^. 
Determine the centroid of the labeled and the remaining unla¬ 
beled samples in Ck and similarly update rep(Ck)- 


However we choose only the representative samples of ev¬ 
ery cluster (i.e. the sample closest to the cluster centroid), 
rep(C), rather than all the samples in Af, to gain speed. 
Diversity of the unlabeled samples is thus, dehned as: 


div{x.) = 1 — 


max — 

y/KGauss 


KGaussi^i Xi) 

(X,X) X Kcaussiy^i.^i) ’ 


where S is the set of all T cluster representatives S = 
{rep(Ci),rep(C2), ...,rep(CT)} . 

Now, we combine these measures to determine the infor¬ 
mativeness of an unlabeled sample x as 


/n/o(x) = Ai X unet{x.) + A 2 x den{x.) + A 3 x dzr’(x). 


We rank the unlabeled samples in the order of decreasing 
Info{:K.) score, to select a batch of top-i^ samples for labeling. 
While Zha et al. use a combination of sample local structure, 
density, diversity, and relevance to score the samples [24] , 
our approach differs, most notably, in the use of clustering 
and a novel uncertainty measure. 

Equation reveals that a cluster with low inter-sample 
disagreement has a low entropy. As more samples in a clus¬ 
ter C are labeled, the disagreement among its labeled sam¬ 
ples increases. This changes the empirical entropy in a 
monotonically non-decreasing fashion. Therefore we rehne 
the clusters by doing the following: After each batch of la¬ 
beling, the algorithm determines the cluster with the worst 
entropy and uses its as a threshold to decide whether 
to keep or split a cluster during the next batch and this is 
repeated for successive iterations. If a newly labeled sample 
increases the cluster entropy beyond the threshold for that 
batch, then a grid search is used to determine the first la¬ 
beled sample without which the cluster meets the entropy 
threshold. We create a new cluster with this sample and re¬ 
arrange the unlabeled samples via 2-Means, like before (see 
Algo.[^. 


3. EXPERIMENTS 

We conduct two sets of experiments. In each set, the ex¬ 
perimental dataset is divided into training and test subsets. 
For the first set of experiments, the task of an algorithm is 
to annotate a test video with a subset of concepts from the 
vocabulary. The algorithm starts with the training data set 
divided into labeled (£) and unlabeled {U) parts. Initially 
only a small subset of the training set is considered to be 
labeled. The algorithm uses this information to annotate 
the test set with concept labels. For the next step, the al¬ 
gorithm selects a batch of K unlabeled training samples, we 
reveal the labels for the selected samples, and the algorithm 
repeats the annotation task. For every iteration, we com¬ 
pute precision scores of the algorithm on the test-set for each 
concept and report their average. We call this score: AP. 

In the second set of experiments, an algorithm ranks the 
test samples by their similarity to a single word query with¬ 


out annotating the test samples. Again, the algorithm starts 
with the training dataset divided into labeled and unlabeled 
parts. For each concept label in the vocabulary, the al¬ 
gorithm ranks the test samples by their similarity to the 
concept. It then selects a batch of K unlabeled training 
samples, we reveal the labels for the selected samples, and 
the algorithm repeats the ranking task. For each round, we 
report the AP scores for the top 5 images/videos. 

3.1 Datasets 

TRECVID 2007: The TRECVID 2007 video corpus has 
110 short video clips [^. Each frame in every video is anno- 
tated with at most 16 concept labels selected from a set of 36 
concepts such as “crowd”, “building”, “airplane”, etc. This 
corpus has been used extensively in video annotation exper¬ 
iments [^. In recent multimedia recognition/annotation 
tasks histograms have been found to be effective as a fe atur e 
summarization technique for text content( p^ , [^, [^), 
acoustic content ( [^) and images/video( U, [^, [§). There- 
fore, for every frame we compute a 225-dimensional feature 
vector (color moment, edge orientation histogram, wavelet 
PWTTWT texture) as described in the work of Zha et al. . 
We test our model on the frames from 13 randomly selected 
videos and we use the rest of the data (frames from 97 
videos) for training. We selected 4000 frames from the train¬ 
ing data as the initial set of labeled samples £, containing 
at least 1 positive example of every concept. We set, batch 
size, K to 2400. 

use SmartBody: SmartBody is an open virtual char¬ 
acter animation platform. It ships with a library of 274 
animations such as walking, hand beat gesture, pointing, 
eye-brow raising, lip corner stretching, etc. [^. The ani¬ 
mations are dehned on a 3D skeleton consisting of 119 in¬ 
dividual joints and the 3D coordinates of these joints are 
available from the SmartBody API. Each animation is anno¬ 
tated using at most 6 concept labels from a set of 30 labels 
such as “Legs”, “Arms”, “Face”, “Left”, “Right”, etc. The 
X-axis of Figure gives an exhaustive list of all the con¬ 
cepts. The animations are annotated at the video clip level 
(i.e. the individual frames are not annotated). 9 out of 119 
joints have been handpicked for feature computation (neck, 
left(L)/right(R) shoulders, L/R elbows, L/R hip joints, and 
L/R knees). For each frame in an animation, the skeleton 
angles at these joints are computed and the differences 
between the minimum and the maximum values for the an¬ 
gles during the whole animation sequence have been encoded 
as a 9-dimensional feature vector. This dataset called the 
use SmartBody Annotation-Retrieval Dataset (SARD) has 
recently been made available for research by the commu¬ 
nity a- We randomly selected 24 animations for testing 
and we use the rest of the data (250 animations) for train¬ 
ing. We selected 40 animations from the training data as 
the initial set of labeled samples £, containing at least one 
positive example of each concept. We now set, batch size, 
K to 23. 

3.2 Baseline Systems 

For annotation task, we compare CRM Active with two 
methods. The hrst one is an active learning system that 
uses NormCRM as the learning engine while the samples 
are selected randomly. The results are averaged over 3 runs 
with different random seeds. The second baseline is the 
method proposed by Zha et al. (state-of-the-art) [^. We 






determine the two NormCRM smoothing parameters A and 
[3 [^, and the validated parameters of the second baseline 
using 10 -fold cross-validation on the first annotation batch. 
These values are then fixed for successive rounds. The val¬ 
ues of the fixed parameters for the second baseline are reused 
from the paper [^. For CRMActive, probability 7 ^, is re- 
estimated from the labeled training data on each annota¬ 
tion batch and the weighting parameters Xi = |,z = 1..3. 
Finally, both NormCRM and CRMActive work by ranking 
annotation concepts, so we assign the top-16 concepts for 
TRECVID 2007 and the top -6 for SmartBody as relevant. 
For direct retrieval, CRMActive is compared only with the 
first baseline discussed above, since no prior work is known. 


Sample Frames from a Query Video 

» 1 I 

G round Truth Annotation: Arms Raise Left 

CRMActive Annotation Round 0 

(Top-6): Arms High Legs Shoulder Headturn Run 

CRMActive Annotation Round 7 

(Top-6): Arms Raise Left Headturn Run High 

Figure 1: A sample annotation result on SmartBody 
dataset, showing the top-6 annotated labels by CR¬ 
MActive after Round 0 and Round 7. 
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Table 1: AP scores (on Y-axis) for annotation on 
TRECVID (a), SmartBody (b) and AP scores (on 
Y-axis) for retrieval of top-5 videos on TRECVID 
(c), SmartBody (d). 

3.3 Results and Discussion 

The results in Table [T] shows that both the NormCRM- 
based models, i.e. the first baseline (NormCRM) and CR¬ 
MActive, generally perform better than the Zha et al. ap¬ 
proach for annotation. We believe that this is due to the fact 
that NormCRM captures the inter-label correlation while 
Zha et al. trains individual classifiers for every concept. 
Also the NormCRM-based systems jointly model the labels 
and features, which allows them to capture the patterns from 
both these perspectives, this is again not the case for Zha 
et al. Furthermore CRMActive by selecting the more infor¬ 
mative samples first, trains a more robust model early on, 
which results in its monotonic non-decreasing AP score for 
annotation/retrieval. This is in contrast with the occasional 
dips in the AP scores of the random baseline, which might 
potentially select some of the relatively “bad” (noisy) train¬ 
ing samples early on. Figure shows a sample annotation 
result on the SmartBody dataset using CRMActive. We see 
that the model gets all top 3 labels correct at Round 7, even 
before the training data is fully annotated. 

Figurej^shows the annotation performance of all the mod¬ 
els for the individual concepts of the SmartBody dataset 
over two rounds (initial and towards the end). The concept 
scores for the NormCRM random baseline are obtained by 
averaging over the results of the 3 runs. We notice a per¬ 
formance gain for all the models across most concepts over 
the two rounds, indicating that more training data helps. 


Zha et al. RO NormCRM RO CRMActive RO "Zha et al. R7 "NormCRM R7 "CRMActive R7 



Figure 2: Precision scores for annotation of individ¬ 
ual concepts of SmartBody for Round 0 (RO) and 
Round 7 (R7) of active learning. 


We also notice that CRMActive is always at least as good, 
on all concepts. For concepts with a high number of posi¬ 
tive examples, such as Legs, all models do well. Further, we 
believe that the nature of the features used can explain a 
good performance by all models for complex concepts such 
as Dance as compared to some others ones like Mouth. 

4. CONCLUSIONS 

In this work, we proposed a sample selection algorithm 
based on active learning by combining a novel measure of 
sample uncertainty and a novel cluster-refinement approach 
for determining sample density and diversity. This approach 
is shown to outperform multiple baselines at both annota¬ 
tion and retrieval tasks. Our experiments also reveal the 
pros of using a generative approach of jointly modeling both 
the features and labels. CRMActive is thus shown to be a 
promising active learning approach to explore. 
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